Skip to content

[KYUUBI #7305] Fix JDBC engine returning tables from all databases when a database is specified in the URL#7419

Closed
wangzhigang1999 wants to merge 5 commits intoapache:masterfrom
wangzhigang1999:KYUUBI-7305
Closed

[KYUUBI #7305] Fix JDBC engine returning tables from all databases when a database is specified in the URL#7419
wangzhigang1999 wants to merge 5 commits intoapache:masterfrom
wangzhigang1999:KYUUBI-7305

Conversation

@wangzhigang1999
Copy link
Copy Markdown
Contributor

Why are the changes needed?

This is a bug fix.

The Hive JDBC driver rewrites a null schemaPattern to "%" in GetTables, and the JDBC engine forwards it directly into WHERE TABLE_SCHEMA LIKE '%', so DBeaver — which always passes null once a connection is established — sees tables from every database in the backend rather than the one in its connection URL. Compounding this, the JDBC engine never applies the URL-level database to the underlying JDBC connection, so there is no "current database" on the backend for a filter to fall back to. The fix introduces a dialect-level setCurrentDatabase hook, tracks an effectiveDatabase on the session, and routes blank/"%" schema patterns through it in GetTablesOperation. MySQLDialect additionally implements getCatalogsOperation / getSchemasOperation with JDBC-standard column labels (TABLE_CATALOG / TABLE_SCHEM) so the schemas tree in DBeaver populates correctly.

Non-MySQL dialects inherit the default no-op setCurrentDatabase and effectiveDatabase stays None, so PostgreSQL / Oracle / etc. keep their existing behavior.

Note: getSchemas() continues to return all accessible schemas per the JDBC spec — the URL-level database only affects the default query context and the blank/"%" fallback in getTables. This matches the behavior of the native MySQL JDBC driver. Clients that want to restrict the visible schema list should do so via client-side filters (e.g. DBeaver's "Schemas" filter).

How was this patch tested?

Added unit / integration tests in MySQLOperationSuite covering:

  • getTables with a database in the URL returns only that database's tables
  • getTables with a "%" schema pattern is routed to the session's effective database
  • getSchemas returns results with JDBC-standard TABLE_CATALOG / TABLE_SCHEM column labels
  • getCatalogs behavior

Also verified end-to-end against a live StarRocks backend using Spark's Hive-JDBC beeline:

# URL with a database → only that database's tables
jdbc:hive2://127.0.0.1:10009/doctor              → !tables : 3 rows, all TABLE_SCHEMA=doctor
jdbc:hive2://127.0.0.1:10009/information_schema  → !tables : only information_schema views

# URL without a database → previous no-filter behavior preserved
jdbc:hive2://127.0.0.1:10009/                    → !tables : tables from every database

# DatabaseMetaData.getSchemas() returns JDBC-standard columns
columns: TABLE_CATALOG, TABLE_SCHEM
rows:    def/doctor, def/information_schema, def/sys, ...
image image

Was this patch authored or co-authored using generative AI tooling?

Human-authored core design and implementation. Tests, verification, and PR description drafting were assisted by Claude Code.

Generated-by: Claude Code (Opus 4.7)

Closes #7305. Supersedes the stale #7307 per @pan3793's suggestion there.
Co-authored-by: zhzhenqin zhzhenqin@users.noreply.github.com

…ses when a database is specified in the URL

When DBeaver (or any Hive-JDBC client) connects to the Kyuubi JDBC engine with
a database path in the URL, DatabaseMetaData.getTables should return only the
tables from that database. Previously, the Hive JDBC driver converted a null
schemaPattern to "%" before sending the request, which the engine forwarded
straight into a WHERE TABLE_SCHEMA LIKE '%' clause on the backend, returning
tables from every database.

- Introduce JdbcDialect.setCurrentDatabase(connection, database) as an
  extension point; the default is a no-op returning false. MySQLDialect
  overrides it with USE `db` so MySQL/StarRocks/Doris switch the backend
  connection on session open.
- JdbcSessionImpl records the successfully applied database in
  effectiveDatabase and only keeps it when setCurrentDatabase returns true.
  "default" (the Hive JDBC driver's fallback when no database is specified in
  the URL) is tolerated on USE failure to stay compatible with backends that
  have no "default" database.
- JdbcOperationManager.newGetTablesOperation treats a blank schema or "%" as
  "use the session's effective database", routing through
  effectiveDatabase.orNull. Backends without an effective database
  (PostgreSQL, Oracle, etc.) keep the previous no-filter behavior.
- MySQLDialect now implements getCatalogsOperation and getSchemasOperation
  (previously featureNotSupported). getSchemasOperation aliases result columns
  to the JDBC-standard TABLE_CATALOG / TABLE_SCHEM so
  DatabaseMetaData.getSchemas works with clients such as DBeaver.

Closes apache#7305.
@wangzhigang1999 wangzhigang1999 marked this pull request as ready for review April 24, 2026 10:54
} else {
schemaName
}
val query = dialect.getTablesQuery(catalogName, effectiveSchema, tableName, tableTypes)
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't recall the details of the initial code of this part, but the API looks problematic.

the Hive Thrift GetTablesOperation API is derived from the JDBC API, and that's why they look almost identical, for JDBC engine, such a call should directly map to the JDBC API instead of requesting a SQL (if you take a look at the JDBC driver implementation, you may find some impls are non-SQL-based), the current design complicates the whole thing.

Copy link
Copy Markdown
Contributor Author

@wangzhigang1999 wangzhigang1999 Apr 28, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for raising this — you're right, and I want to confirm I took it seriously rather than hand-wave it away. To make sure I understood, I ran the full set of DatabaseMetaData calls against three different drivers (mysql-connector-j 8.4.0, postgresql 42.7.2, clickhouse-jdbc 0.6.5) using a small Java probe and docker containers. Test fixtures were tailored to each backend (MySQL/PG had PK + FK + a function; ClickHouse used MergeTree which has no native PK/FK).

Cross-driver behavior of each API:

API MySQL (databaseTerm=CATALOG) MySQL (databaseTerm=SCHEMA) PostgreSQL ClickHouse
getCatalogs() ✓ implemented ⚠ empty (driver databaseTerm selection) ✓ implemented ✓ implemented
getSchemas() ⚠ empty (driver databaseTerm selection) ✓ implemented ✓ implemented — model has no schema layer
getTableTypes()
getTables(...)
getColumns(...)
getPrimaryKeys — MergeTree has no PK
getImported / Exported / CrossReference — no FK in CK
getFunctions(...) (no fixture created)
getTypeInfo()

Things that hold across all three drivers:

  • Every JDBC-standard metadata method is implemented — none threw SQLFeatureNotSupportedException; concepts that don't apply just return empty result sets. (The JDBC contract requires this; clients differ in how strictly they enforce it.)
  • Column names follow the JDBC spec (TABLE_CAT, TABLE_SCHEM, TABLE_NAME, COLUMN_NAME, DATA_TYPE, KEY_SEQ, PK_NAME, ...) — which is exactly what GetTablesOperation / GetColumnsOperation / etc. on the Hive Thrift side expect, since the Thrift API was derived from the JDBC API in the first place.
  • Driver-specific quirks are minor and well-defined — the databaseTerm selection is unique to mysql-connector-j; PG returns column labels in lowercase while MySQL/CK use uppercase (JDBC spec says case-insensitive, but engine-side string comparisons would need to match).

So the JDBC standard API really does cover every operation the dialect models today, across very different DBMS families. Concretely, this means the JdbcDialect.getXxxQuery family of methods could be deleted in favor of the engine calling connection.getMetaData.getXxx(...) directly — and dialects whose driver implements metadata via a non-SQL path (Avatica, etc.) would automatically be used the way they were designed to.

The few wrinkles I hit are all driver-side and resolvable in a few lines of engine-level adapter code, not by reaching for SQL:

  • getCatalogs / getSchemas are selected by mysql-connector-j's databaseTerm property (only one returns rows in a given mode) — handled by checking the property and routing, or by calling both and taking whichever is non-empty.
  • DATA_TYPE is int in the JDBC spec but the current SQL returns a string — java.sql.Types-to-name mapping is one helper.
  • The current SQL returns extra MySQL-specific columns (ENGINE, TABLE_ROWS, ...) but the Hive Thrift GetTables result schema is fixed to JDBC-standard columns, so those extras are already dropped during serialization today and aren't reaching clients — no real loss.

I'd like to keep this PR scoped to the URL-database bug since that's what it set out to fix, and migrating the dialect API to call DatabaseMetaData directly will touch every dialect, the operation manager, and every metadata test — not something I'd want to fold into a bug fix. Happy to open a follow-up issue tracking the migration if you agree with that split. WDYT?

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

migrating the dialect API to call DatabaseMetaData directly will touch every dialect, the operation manager, and every metadata test

if we do that first, do we still need this patch?

Copy link
Copy Markdown
Member

@pan3793 pan3793 Apr 28, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No SQLFeatureNotSupportedException anywhere

Actually, the JDBC API Javadocs define behavior in a very clear way, it should not throw SQLFeatureNotSupportedException unless Javadocs say it can. Some tools, like DBeaver, may tolerate it if the JDBC driver vendor does not respect the API contract, but tools like DataGrip may not.

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

BTW, we MUST avoid copying the code from mysql-connector-j, it's GPL-licensed ... try to read the docs or use black box testing to tackle the mysql-connector-j-related issue. (this will be a horrible experience ...)

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

migrating the dialect API to call DatabaseMetaData directly will touch every dialect, the operation manager, and every metadata test

if we do that first, do we still need this patch?

Yes — the bug is the Hive JDBC client converting null schemaPattern to "%", and JDBC spec says "%" means "don't narrow by schema", so calling DatabaseMetaData.getTables(...) directly reproduces it. Verified on mysql-connector-j 8.4.0 (connection opened against jdbc:mysql://.../dbA):

conn.getCatalog() == "dbA"
getTables(null, "%",  "%", null) → tables from dbA AND dbB   ← bug reproduces
getTables("dbA", null, "%", null) → tables from dbA only     ← only narrows when explicit

So the engine still has to substitute the URL-scoped database into the call when the client sends null/"%". That routing logic (effectiveDatabase + the if-blank-then-route block in JdbcOperationManager) survives the refactor unchanged — only the line that builds a SQL string flips to conn.getMetaData.getTables(...). The setSchema-on-session-open machinery is also independent.

What the refactor does subsume: the two getCatalogsOperation / getSchemasOperation SQL implementations I added to MySQLDialect — those go away once the engine calls getCatalogs() / getSchemas() directly.

Happy either way on ordering — bug fix first then refactor, or refactor first and a smaller patch on top.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

BTW, we MUST avoid copying the code from mysql-connector-j, it's GPL-licensed ... try to read the docs or use black box testing to tackle the mysql-connector-j-related issue. (this will be a horrible experience ...)

Agreed, and thanks for catching this early. I've rewritten all my prior PR comments and the in-tree comment in MySQLDialect.setSchema to remove any reference / paraphrase of mysql-connector-j source — keeping only the documented databaseTerm property and black-box test observations. Also amended the FOLLOWUP commit message similarly. Will follow this rule going forward.

…tandard setCatalog/setSchema

Replace the hand-rolled `USE \`db\`` with `Connection.setCatalog` plus
`Connection.setSchema` in MySQLDialect.setCurrentDatabase. The MySQL
Connector/J `databaseTerm` connection property determines which of the two
calls actually switches the session database; invoking both ensures the
switch happens regardless of how the user configured the URL, while staying
on standard JDBC API instead of hand-rolled SQL. StarRocksDialect /
DorisDialect inherit this via MySQLDialect.

Also add three StarRocks regression tests covering: scoping when URL
specifies a database, no-filter behavior when URL has no database, and
session-open failure on a non-existent database.
…Dialect

Per review feedback, replicate the JDBC `Connection#setSchema(String)` API on
JdbcDialect with a leading `conn` parameter. The default forwards to the
standard API; dialects whose driver does not support it can override to a
no-op. Drop the boolean return value — callers treat the call as if it always
takes effect. JdbcSessionImpl now records `effectiveDatabase` unconditionally
after the call, decoupling "backend connection switched" from "engine
metadata scope".
…ession open

The Hive JDBC driver sends `USE_DATABASE = "default"` when the user did not
specify a database in the connection URL — a protocol-level stub
indistinguishable from a genuine request for a database literally named
"default". Forwarding that into `setSchema` plus recording it as
`effectiveDatabase` caused PG and ClickHouse metadata operations to scope to
a non-existent or unintended schema, breaking the existing
`get tables` suites in CI.

Filter `"default"` at the session entry point: skip both the dialect
`setSchema` call and the `effectiveDatabase` assignment for it. Drops the
prior try/catch that only existed to tolerate the failure path. Also rewords
the in-tree `MySQLDialect.setSchema` comment to describe the externally
observed behavior of the documented `databaseTerm` connection property,
without paraphrasing driver internals.

URL paths that name a real non-`default` database keep the new scoping
behavior; URL paths that explicitly name `default` continue to behave as
before this PR (no metadata scope), which is the same as URL paths with no
database — there is no observable regression.
@wangzhigang1999
Copy link
Copy Markdown
Contributor Author

Where the patch stands now

Posting a wrap-up. Want to flag one decision up-front since it's the least obvious from the diff and the most likely thing to want a sanity-check.

Heads-up: how "default" is handled at session open

The Hive JDBC driver sends USE_DATABASE = "default" as a stub when the user did not specify a database in the URL — there is no protocol-level signal that distinguishes that stub from a genuine request for a database literally named default. After the FOLLOWUP that turned setSchema into the unconditional path (per your review feedback), this caused the PG and ClickHouse get tables suites to fail in CI: their drivers silently accept setSchema("default") (PG sets search_path to a non-existent schema, CH actually has a default database), so the engine then scoped metadata to a non-existent or unintended schema.

The fix is to filter "default" at the session entry point. The trade-off, made explicit:

URL form Before this PR After this PR
no database in URL no metadata scope no metadata scope (unchanged)
URL specifies a non-default database no metadata scope (the original bug) scoped to that database ✓
URL explicitly names default (and the backend has one) no metadata scope no metadata scope (no change)

No observable regression — the third row was already in this state before this PR (the engine never URL-scoped at all). Users on backends with a real default schema can still target it after connecting via USE default;, which goes through a different code path (newSetCurrentDatabaseOperation). I considered alternatives (try/catch around setSchema, or "verify the schema exists before recording it") but the first is brittle on PG/CH, and the second still does the wrong thing on ClickHouse since CH genuinely has a default database.

The rest of the patch, briefly

Three engine-side layers:

  1. Dialect APIJdbcDialect.setSchema(conn, schema): Unit mirrors Connection#setSchema with a leading conn; default forwards to the standard. Only MySQLDialect overrides — calls both setCatalog + setSchema so the database switch happens regardless of the databaseTerm connection property the user configured.
  2. Session open — invokes the dialect's setSchema for the URL-scoped database (modulo the "default" stub above) and records it as effectiveDatabase.
  3. Operation routingJdbcOperationManager.newGetTablesOperation substitutes effectiveDatabase when the client sends blank or "%" schemaPattern, undoing the Hive driver's null → "%" translation.

Tracked separately

Migrating the dialect API from "return a SQL string" to "engine calls DatabaseMetaData.getXxx(...) directly" — your point on the GetTablesOperation thread, with cross-driver verification confirming feasibility on MySQL / PG / CH. This PR's URL-database routing logic survives that refactor unchanged. Will open a separate issue.

@pan3793
Copy link
Copy Markdown
Member

pan3793 commented Apr 29, 2026

Can we do mapping DatabaseMetaData.getXxx(...) first? I think this fix will be clearer after that - each Thrift request will exactly match the JDBC DatabaseMetaData API, no need to transform the parameter value.

@wangzhigang1999
Copy link
Copy Markdown
Contributor Author

Happy to do the refactor first if that's the preferred order. One thing to confirm though, since I want to make sure we're aligned: as I noted above, the URL-database routing (effectiveDatabase substitution when the client sends null / "%" schemaPattern) is still required after the refactor — verified empirically on mysql-connector-j 8.4.0, where conn.getMetaData.getTables(null, "%", "%", null) on a connection opened against /dbA still returns tables from all DBs. The refactor cleans up the SQL-string dialect methods but does not subsume this fix.

Given that, do you still prefer refactor-first, or merge this as the minimal bug fix and follow up with the refactor? Either order works for me.

@pan3793
Copy link
Copy Markdown
Member

pan3793 commented Apr 29, 2026

the JDBC API Javadocs say

"" retrieves those without a schema; null means that the schema name should not be used to narrow the search

I interpret it to

  • "": filter with TABLE_SCHEM = <current schema>
  • null: retrieve all schemas
  • "%": filter with TABLE_SCHEM LIKE '%', same as null
  • others, actually include "%": filter with TABLE_SCHEM LIKE '<pattern>'

I don't understand why effectiveDatabase substitution is required here?

@pan3793
Copy link
Copy Markdown
Member

pan3793 commented Apr 29, 2026

Note, some JDBC clients (e.g., Cloudera HUE) use * instead of % or null to match all patterns, which is not supported by the JDBC standard, but is supported by many RDBMS, and the implementations are various, either on the JDBC driver side or server side. IIRC, HiveServers and Kyuubi support * for HUE compatibility, but Spark Thrift Server does not.

@wangzhigang1999
Copy link
Copy Markdown
Contributor Author

You're right, and I want to walk back my earlier claim that the substitution is required. I tested this end-to-end against master (without this PR) using DBeaver and a real MySQL backend, and traced what actually triggers the bug:

On master, MySQLDialect does not override getCatalogsOperation / getSchemasOperation, so the JdbcDialect defaults take over and throw KyuubiSQLException.featureNotSupported(). The result on the wire (DBeaver metadata refresh):

-- Load catalogs   → throws featureNotSupported
-- Load schemas    → throws featureNotSupported
-- Load tables [null, null, %, null]   ← falls back to "give me everything"
-- Load type info

So the root cause of KYUUBI-7305 is that DBeaver fails to obtain a database list and falls back to getTables(null, null, ...), which Hive client rewrites to schemaPattern="%" and the engine treats per spec as "all schemas". On the same setup with this PR applied, DBeaver gets a real catalog/schema list and switches to issuing scoped calls like getTables(null, "dataeye_starpony", "%", null) on its own — no engine-side substitution needed.

That makes the effectiveDatabase substitution in this PR redundant in practice — your spec interpretation is correct, and the fix users actually need is the getCatalogsOperation / getSchemasOperation implementations plus setSchema at session open. I'll drop the substitution layer and keep the rest, then we can move on to the dialect-API refactor you outlined as the proper follow-up. Apologies for the runaround on this thread.

@pan3793
Copy link
Copy Markdown
Member

pan3793 commented Apr 29, 2026

the fix users actually need is the getCatalogsOperation / getSchemasOperation implementations plus setSchema at session open.

yes, this meets my expectations exactly.

@wangzhigang1999
Copy link
Copy Markdown
Contributor Author

Going to close this PR — the discussion drifted from the actual root cause, and the substitution mechanism here turned out to be unnecessary as you pointed out. Apologies for the noise on this thread; I should have traced the failure path empirically much earlier instead of arguing from incomplete reasoning, and I appreciate your patience walking me back to it.

I'll open a new PR taking the direction you've been suggesting all along: align the dialect API with JDBC semantics by migrating the hand-written SQL methods (getCatalogsOperation / getSchemasOperation / getTablesQuery / getColumnsQuery / etc.) to call connection.getMetaData.getXxx(...) directly, plus implement setSchema at session open so the URL-scoped database is honored on the connection. That's both more uniform and properly fixes KYUUBI-7305 — DBeaver gets a real catalog/schema list back instead of featureNotSupported, and stops falling back to getTables(null, null, ...).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[Bug] DBeaver returned all tables in all databases when connecting.

3 participants